In the evolving landscape of database technology, the distinction between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) systems is fundamental. OLTP systems, designed for managing transactional data, prioritize speed and reliability in data handling. In contrast, OLAP systems are structured to facilitate complex queries and analytics, analyzing massive quantities of data and offering deep insights into data patterns over time.
By integrating OLTP and OLAP systems and facilitating the immediate accessibility of transactional data, organizations can significantly enhance their capabilities around business intelligence, predictive analytics, and operational efficiency.
CockroachDB and Google Cloud BigQuery stand out in their respective domains of OLTP and OLAP for their scalability, reliability, and performance. This blog explores how enterprises can leverage CockroachDB as an OLTP database and utilize BigQuery for OLAP and further enhance data capabilities using BigQueryML and Vertex AI for machine learning.
Online Transaction Processing (OLTP) databases are at the heart of most mission critical business operations, handling everything from customer orders to real-time inventory management. These systems are optimized for high transaction volume, ensuring data integrity and speed in processing multiple concurrent transactions.
CockroachDB is a distributed SQL database designed for cloud services, offering a unique blend of traditional SQL database ease-of-use with the scalability and resilience of NoSQL systems. It excels in OLTP scenarios thanks to its:
By leveraging CockroachDB for OLTP, businesses can ensure that their transactional systems are not only robust and efficient but also ready to meet future demands. You can read more about CockroachDB’s architecture here.
Online Analytical Processing (OLAP) systems are designed for query-intensive data analysis, offering insights into historical data across various dimensions. Unlike OLTP systems that focus on fast transaction processing, OLAP is optimized for complex queries and analytics, enabling businesses to derive meaningful insights from their data.
BigQuery stands as a leading OLAP solution, delivering unparalleled query performance, scalability, and operational efficiency for enterprise-grade data warehousing and analytics initiatives. BigQuery’s serverless architecture allows analysts and data scientists to run queries on massive datasets without managing underlying infrastructure, making it an ideal platform for exploring and analyzing transactional data processed by OLTP systems like CockroachDB.
By leveraging BigQuery for OLAP, organizations can unlock deep insights into their data, driving better business decisions and strategies.
To harness the full power of transactional data for analytical processing, integrating CockroachDB with BigQuery is essential. This integration enables the seamless flow of data from the OLTP system (CockroachDB) to the OLAP system (BigQuery) for in-depth analysis.
In the example below, we will create a pipeline to replicate data from CockroachDB to BigQuery using Google Cloud Pub/Sub. The data we’ll be using is sales data from a fictional eCommerce clothing store called Atom Fashion. Atom Fashion is using CockroachDB as the OLTP database for its online sales. As new orders are placed, the sales data is replicated to BigQuery.
Once the data is in BigQuery, the Atom Fashion Marketing team will take advantage of the BigQuery Machine Learning (BQML) integration with Vertex AI, Google Cloud’s comprehensive platform for AI, as well as Gemini Pro, Google’s most powerful multi-modal AI model. We’ll ask the model to generate customized emails to a specific subset of customers offering them a discount on future purchases based on their past purchase history.
Performing this email marketing campaign manually could take days or even weeks. However, using the combined power of CockroachDB and Google Cloud, you should be able to walk through this tutorial and create a successful custom email marketing campaign based on your most recent sales data in under an hour!
With data seamlessly flowing from CockroachDB to BigQuery, leveraging BigQueryML can unlock powerful predictive insights and automate decision-making processes. BigQueryML enables users to create and execute machine learning models directly within Google BigQuery using SQL queries, simplifying the path from data analytics to machine learning.
When it comes to elevating these capabilities, the integration between BigQueryML and Vertex AI, Google Cloud’s unified machine learning platform, takes it a step further. This integration provides an intuitive way for BigQueryML to handle communication with Vertex AI, thus providing a cohesive workflow for users. Vertex AI provides an end-to-end environment for building, deploying and scaling ML models while also offering tools that integrate smoothly with BigQueryML models.
BigQueryML supports various machine learning models, including linear regression for forecasting, logistic regression for classification, k-means clustering for segmentation, etc. Once you have your foundational model trained, Vertex AI allows you to refine and optimize these models with its advanced MLOps tools. With Vertex AI, you can also experiment with more complex models, automate the ML lifecycle and deploy models to production with ease.
The instructions in this guide demonstrate how to build this integration using the Google Cloud command line interface. All of these steps can be executed using the Google Cloud UI as well.
If you’d like to keep your work from this tutorial separate from other work that you have in your Google Cloud Cloud Console, you may want to consider creating a separate Google Cloud Project for storing all of the objects that you’ll be creating in the following steps. The instructions for creating a new project can be found here. The examples in this document will use a project named cockroach-project
. If you use a different project, please replace all references to cockroach-project
in this document with your own project id.
In this tutorial, you can set up a fully managed CockroachDB cluster on Google Cloud. You can choose between two deployment options, CockroachDB Dedicated or CockroachDB Serverless. To set up your CockroachDB Dedicated cluster on Google Cloud, you can follow these steps, or you can follow the steps here to set up a CockroachDB Serverless cluster on Google Cloud. Once you have your database cluster configured, you can move to the next step to load data onto your CockroachDB cluster.
This tutorial uses data from a fictional eCommerce Store. For detailed instructions on loading this data to CockroachDB, see Appendix A at the end of this document
The next step is to configure CockroachDB to publish changes to a Google Cloud Pub/Sub topic. This involves setting up change data capture (CDC) on the tables you want to analyze, ensuring that every insert, update, or delete is captured in real-time. Read the note below and then follow the instructions in the official Cockroach Labs documentation to set up Pub/Sub as the changefeed sync for CockroachDB.
Note: For this tutorial, we will be making the following changes to the instructions specified in the previous paragraph:
Old Name | New Name |
---|---|
Topic = movr-users | Topic = order_items_crdb |
Subscription = movr-users-sub | Subscription = order_items_crdb_sub |
You’ll make these changes in Step #8 and Step #10
In the following step, we will be creating a table in BigQuery to hold the data that will be replicated from CockroachDB. We’ll create a new dataset named cockroach_pubsub_data to keep the BigQuery objects that will be created in this tutorial organized.
bq --location=US mk --dataset \
--default_table_expiration=0 \
cockroach-project:cockroach_pubsub_data
Now that we have a dataset to keep us organized, we’ll create a table named order_items_data to hold the data that Pub/Sub will be writing to BigQuery. The table will contain rows from the CockroachDB table order_items as a single JSON object. We will add a few additional columns for storing Pub/Sub metadata.
bq mk --table \
cockroach_pubsub_data.order_items_data \
subscription_name:STRING,\
message_id:STRING,\
publish_time:TIMESTAMP,\
data:STRING,\
attributes:STRING
The Pub/Sub Service Account needs to be granted permissions to interact with the new BigQuery Dataset
gcloud projects add-iam-policy-binding cockroach-project \
--member=serviceAccount:service-$(gcloud projects describe cockroach-project --format="value(projectNumber)")@gcp-sa-pubsub.iam.gserviceaccount.com \
--role=roles/bigquery.dataEditor
gcloud projects add-iam-policy-binding cockroach-project \
--member=serviceAccount:service-$(gcloud projects describe cockroach-project --format="value(projectNumber)")@gcp-sa-pubsub.iam.gserviceaccount.com \
--role=roles/bigquery.user
Create a subscription in Google Cloud Pub/Sub that consumes messages from the topic set up in Step #2 above (order_items_crdb
). This subscription will write all of the changes made to the order_items
table in CockroachDB directly into BigQuery.
gcloud pubsub subscriptions create order_items_crdb_bq_sub \
--project=cockroach-project \
--topic=order_items_crdb \
--bigquery-table=cockroach-project:cockroach_pubsub_data:order_items_data \
--write-metadata
In order to simulate the flow of data into the CockroachDB table order_items
, we’ll insert a subset of the data that is in the order_items_stage
table. Inserting the data into the order_items
table will result in the changefeed that you created on the order_items
table flowing all of the newly inserted rows to the order_items_crdb
Pub/Sub topic. The subscription that you created in the previous step, order_items_crdb_bq_sub
, will take all of the data from the Pub/Sub topic and insert it into the order_items_data
table in BigQuery.
INSERT INTO public.order_items\
SELECT *
FROM public.order_items_stage
WHERE id >=1
AND id < 1001;
As you continue your testing, you can run this statement multiple times using different ranges on the id column to simulate more order data flowing through the system.
Now that we have the changefeed operational and order data flowing into BigQuery, we can take advantage of BigQuery’s integrated machine learning functionality (BQML) and the BQML integration with Vertex AI to create a marketing campaign that utilizes the power of Gemini Pro and Generative AI.
Create an external connection to Cloud Resources and name it crdb_bqml. An external connection allows BigQuery to call other GCP services, such as Vertex AI. Also install the jq package which will be used later for extracting JSON data.
Use the +Code button to add a new code block to the notebook:
# Double check the data is in the dataset
# Create a BigQuery Connection to use Gemini Pro, UPDATE THE PROJECT_ID
!bq mk \
--connection \
--location=US \
--project_id=cockroach-project \
--connection_type=CLOUD_RESOURCE crdb_bqml
# install jq for json extraction
!apt install jq
# from the output, copy the service account email
!bq show --format=json --connection cockroach-project.us.crdb_bqml | jq '.cloudResource.serviceAccountId'
The output will look similar to this:
“bqcx-7504964XXXX-5odo@gcp-sa-bigquery-condel.iam.gserviceaccount.com”
aiplatform.user
role to the Service Account. This allows the Service Account to use Vertex AI services.# Update the SERVICE_ACCOUNT field
!gcloud projects add-iam-policy-binding cockroach-project\
--member='serviceAccount:<SERVICE ACCOUNT EMAIL FROM THE PREVIOUS STEP>' \
--role='roles/aiplatform.user' \
--condition=None
# Make sure in the notebook, use %%bigquery for magic function in the cell
%%bigquery
CREATE OR REPLACE MODEL `cockroach-project.cockroach_pubsub_data.crdb_bq_model`
REMOTE WITH CONNECTION `us.crdb_bqml`
OPTIONS (ENDPOINT = 'gemini-pro');
%%bigquery
WITH item_order AS (
SELECT
json_extract(data, "$.Value.after.product_id") as product_id,
json_extract(data, "$.Value.after.user_id") as user_id
FROM \`cockroach-project:cockroach_pubsub_data:order_items_data\`
),
stg_pre_ai AS (
SELECT
user_id,
name as product_name,
brand,
first_name,
last_name,
FROM item_order
JOIN \`bigquery-public-data.thelook_ecommerce.products\` AS product
ON item_order.product_id = CAST(product.id AS STRING)
JOIN \`bigquery-public-data.thelook_ecommerce.users\` AS users
ON item_order.user_id = CAST(users.id AS STRING)
GROUP BY 1,2,3,4,5
)
SELECT
ml_generate_text_llm_result,
product_name,
prompt
FROM
ML.GENERATE_TEXT(
MODEL `cockroach-project:cockroach_pubsub_data.crdb_bq_model`,
(
SELECT
CONCAT('''You are an expert marketer from Atom Fashion working on an engaging marketing campaign. You company uses informal tone and uses references to the beach.
Your task is to generate an html email for a specific customer.
Follow these steps:
1. Identify from the context if the customer has bought sunglasses, if they purchased sunglasses before generate an email, with the following steps. If they bought a different product, do not generate an email and return NO_MATCH.
2. If the customer purchased sunglasses, congratulate them for their previous purchase and offer a 5% discount (code SUPERFUN5)
3. If the customer has not purchased sunglasses, return an empty html.
4. The email should not have more than 500 tokens in length.
5. Return the code as plain html code
6. The email should begin with <html> and end with </html>
Always think step by step.
context:\nProduct Name:''', product_name,'''\nCustomer Name:''', first_name, '''\n HTML CODE: ''') AS prompt,
product_name, first_name, last_name
FROM
stg_pre_ai
LIMIT 200
),
STRUCT(
0.2 AS temperature,
500 AS max_output_tokens,
true AS flatten_json_output
)
)
WHERE ml_generate_text_llm_result like '%html%'
The strategic integration between CockroachDB, Pub/Sub, BigQuery and Vertex AI offers a powerful pathway for enterprises to extract actionable insights to make quick decisions from their transactional data. CockroachDB ensures robust, scalable and consistent transactional data management and its integration with the Google Cloud ecosystem sets the stage for deep analytical processing. This lays the foundation to leverage BigQueryML to initiate machine learning models within the data warehouse itself, only requiring SQL proficiency.
Vertex AI extends this capability by offering tools to refine, deploy, scale machine learning models, thereby enhancing applications with predictive insights and AI driven functionality. This approach not only streamlines the journey from the data collected during the initial transaction to analytical insight but it also allows the enterprise to extract the full potential of their data. By adopting the capabilities of CockroachDB, BigQueryML and Vertex AI, enterprises can make a significant leap in operational agility and strategic foresight.
If you’d like to try this example yourself, the data that is being used comes from the BigQuery public dataset bigquery-public-data.thelook_ecommerce\
. The data from the table bigquery-public-data.thelook_ecommerce.order_items\
will be exported from BigQuery to a .csv file and then imported into CockroachDB.
NOTE: The data in this table is regenerated daily, so expect some variation between the output in this document and the actual output that you receive.
Create a bucket in Google Cloud storage for the data that you will be exporting from the BigQuery Public Dataset.
gcloud storage buckets create gs://cockroach-sample-data --project=cockroach-project --location=US
Export the data from the BigQuery Public Dataset into the bucket that you created in the previous step.
bq extract --destination_format=CSV bigquery-public-data:thelook_ecommerce.order_items
gs://cockroach-sample-data/order_items.csv
In order for CockroachDB to be able to access the CSV file in your Google Cloud bucket, you need to create a Service Account with sufficient permissions for CockroachDB to access the data.
gcloud iam service-accounts create cockroach-bigquery-import
--project cockroach-project
Add the permissions required for Cockroach
gsutil iam ch
serviceAccount:cockroach-bigquery-import@cockroach-project.iam.gserviceaccount.com:objectViewer,legacyBucketReader
gs://cockroach-sample-data
Create a JSON Key file for your service account. This key file is used by CockroachDB for accessing the data in your Google Cloud Storage bucket. Running this command will result in the JSON key file cockroach_biquery_export_key.json being automatically created and downloaded to the default Downloads directory on your local machine.
gcloud iam service-accounts keys create cockroach_biquery_export_key.json --project=cockroach-project --iam-account=cockroach-bigquery-import@cockroach-project.iam.gserviceaccount.com
Create a table to store the data that you’ll be importing by running the command in the SQL Shell or any other SQL editor you prefer.
CREATE TABLE public.order_items_stage (
id INT8 NOT NULL,
order_id INT8 NULL,
user_id INT8 NULL,
product_id INT8 NULL,
inventory_item_id INT8 NULL,
status STRING NULL,
created_at TIMESTAMP NULL,
shipped_at TIMESTAMP NULL,
delivered_at TIMESTAMP NULL,
returned_at TIMESTAMP NULL,
sale_price FLOAT8 NULL,
CONSTRAINT order_items_pkey PRIMARY KEY (id ASC)
);
From a terminal window on your local machine, base64 encode the key file that you downloaded in Step 5.
On a Mac:
cat cockroach_bigquery_export_key.json | base64
On Linux: (to prevent the output lines from wrapping):
cat cockroach_bigquery_export_key.json | base64 -w 0
Import the data from your Google Cloud storage bucket into the table that you created in step 6 and provide your base64 encoded key file for authentication. The base64 encoded data follows the CREDENTIALS= in the IMPORT statement. The import may take a few minutes and the table will be unavailable while the data is being imported.
IMPORT INTO public.order_items_stage (
id, order_id, user_id, product_id, inventory_item_id, status, created_at, shipped_at, delivered_at, returned_at, sale_price)
CSV DATA ('gs://cockroach-sample-data/order_items.csv?AUTH=specified&CREDENTIALS=ewog......p9Cg==')
WITH skip = '1',
DETACHED;
Create an empty table with the same structure as the order_items_stage table. This table will be configured for Change Data Capture and Pub/Sub in a later step.
CREATE TABLE public.order_items (LIKE public.order_items_stage INCLUDING ALL);
You need to send cash from your bank to a friend, who happens to use another bank. The money should take a fairly …
Read moreThis story was originally published in ITPro Today: Merger Maneuvers in the Multi-Cloud.
Software mergers and …
Read moreDo you remember December 7, 2021?
If you work in ops, you might – that’s one of the recent examples of a major public …
Read more